Robert Turner, University of Sheffield RSE Team September, 2021
Heavily based on Reproducible Research Data and Project Management in R, by Anna Krystalli and Methods in Research Software Engineering by David Wilby.
Mix of software engineering and research experience.
13 RSEs, 35 projects / year worth ~£11m total
Practical advice on:
What operating system(s) do you use?
What programming language(s) do you use?
Act as though every short term study will become a long term one @tomjwebb. Needs to be reproducible in 3, 20, 100 yrs
— Oceans Initiative (@oceansresearch) January 16, 2015
Take initiative & responsibility. Think long term.
@tomjwebb stay away from excel at all costs?
— Timothée Poisot (@tpoi) January 16, 2015
Do you agree?
THRILLED by this announcement by the Human Gene Nomenclature Committee. pic.twitter.com/BqLIOMm69d
— Janna Hutz (@jannahutz) August 4, 2020
But good for data viewing / entry, sometimes, perhaps…
@tomjwebb Entering via a database management system (e.g., Access, Filemaker) can make entry easier & help prevent data entry errors @tpoi
— Ethan White (@ethanwhite) January 16, 2015
@ethanwhite +1 Enforcing data types, options from selection etc, just some useful things a DB gives you, if you turn them on @tomjwebb @tpoi
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb it also prevents a lot of different bad practices. It is possible to do some of this in Excel. @tpoi
— Ethan White (@ethanwhite) January 16, 2015
Have a look at the Data Carpentry SQL for Ecology lesson
.csv: comma separated values..tsv: tab separated values..txt: no formatting specified.@tomjwebb It has to be interoperability/openness - can I read your data with whatever I use, without having to convert it?
— Paul Swaddle (@paul_swaddle) January 16, 2015
more unusual formats will need instructions on use.
Andrea De Santis, unsplash.com
.csv or .tsv copy would need to be saved.Use good null values, missing values are a fact of life:
NA or NULL are also good options0. Avoid numbers like -999@tomjwebb don't, not even with a barge pole, not for one second, touch or otherwise edit the raw data files. Do any manipulations in script
— Gavin Simpson (@ucfagls) January 16, 2015
@tomjwebb @srsupp Keep one or a few good master data files (per data collection of interest), and code your formatting with good annotation.
— Desiree Narango (@DLNarango) January 16, 2015
Raw data are sacrosanct
Photo by Jon Moore, unsplash.com
Photo: Pexels CC0
main copy of files@tomjwebb Back it up
— Ben Bond-Lamberty (@BenBondLamberty) January 16, 2015
RNO
myabstract.docx
Joe’s Filenames Use Spaces and Punctuation.xlsx
figure 1.png
fig 2.png
JW7d^(2sl@deletethisandyourcareerisoverWx2*.txt
YES
2014-06-08_abstract-for-sla.docx
joes-filenames-are-getting-better.xlsx
fig01_scatterplot-talk-length-vs-interest.png
fig02_histogram-talk-attendance.png
1986-01-28_raw-data-from-challenger-o-rings.txt
What makes a good file name?
In the following:
ls -lh *Plasmid*
*Plasmid*
is a glob.
Deliberate use of "-" and "_" allows recovery of metadata from the filenames:
"_" underscore used to delimit units of metadata I want to access later"-" hyphen used to delimit words so our eyes don’t bleedThis happens to be R but also possible in the shell, Python, etc.
e.g. I’m saving a number of files of temperature data extracted at different resolutions (res) and for a number of months (month). Including these parameters in the filename allows me to use them to target files to read in.
write.csv(df, paste("variable", res, month, sep ="_"))
df <- read.csv(paste("variable", res, month, sep ="_"))
01_marshal-data.r
02_pre-dea-filtering.r
03_dea-with-limma-voom.r
04_explore-dea-results.r
90_limma-model-term-name-fiasco.r
02_pre-dea-filtering-preDE-filtering.png
03-dea-with-limma-voom-voom-plot.png
04_explore-dea-results-focus-term-adjusted-p-values1.png
04_explore-dea-results-focus-term-adjusted-p-values2.png
...
90_limma-model-term-name-fiasco-first-voom.png
90_limma-model-term-name-fiasco-second-voom.png
Use the ISO 8601 standard for dates: YYYY-MM-DD
If you don’t left pad, you get this:
10_final-figs-for-publication.R
1_data-cleaning.R
2_fit-model.R
which is just sad :(
Go forth and use awesome file names :)
Where shall I put my data?
myproject/
|
├── 01_data/
| ├── 01_raw/
| ├── 02_working/
| └── 03_clean/
|
├── 02_scripts/
|
├── 03_charts/
|
├── 04_paper/
|
└── 05_presentation/
R (rrtools)analysis/
|
├── paper/
│ ├── paper.Rmd # this is the main document to edit
│ └── references.bib # this contains the reference list information
│
├── figures/ # location of the figures produced by the Rmd
|
├── data/
│ ├── raw_data/ # data obtained from elsewhere
│ └── derived_data/ # data generated during the analysis
|
└── templates
├── journal-of-archaeological-science.csl
| # this sets the style of citations & reference list
├── template.docx # used to style the output of the paper.Rmd
└── template.Rmd